Compare Revisions - nigel.stanger/Wiki

nigel.stanger / Wiki

Compare Revisions
View Page Back to Page History

Transcribing lectures using Whisper.md
Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing. Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect. Assuming a good quality recording, the following settings seem to do a good job: * Medium model (specifically `medium.en`). This performs better than the large model when transcribing English because the `large` doesn’t have an English-specific model. * Enable word timestamps. * Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”. * VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.) * Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else. * An initial prompt may improve accuracy, e.g., “This is a postgraduate lecture about ethical issues in big data. The main topics are ethics, law, privacy, data dredging, and statistics.” * Normalising the audio beforehand may or may not help. The `speechnorm` filter in FFmpeg seems quite effective:, e.g.,`ffmpeg -i <input> -filter:a speechnorm=e=12.5 <output>`. For example: ```sh whisper --model medium.en --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 --initial_prompt "<prompt>" <input-file> ``` Offline transcription is roughly real-time (i.e., 1 hour of audio takes about 1 hour to transcribe). Models are downloaded to `~/.cache/whisper`. `whisper-cpp` is actually the one we want as it’s written in C++ and supports Core ML. Annoyingly the CLI options are different, but it seems to have more of them. Uses the same models as Vibe below. Input audio must be 16kHz 🙁 (`ffmpeg -i <input> -vn -ar 16000 <output>`). ```sh whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 --prompt "<prompt>" <input file> ``` Much faster. Vibe seems to be a useful cross-platform GUI implementation. Internally whisper-cpp ported to Rust. Seems to produce very short line lengths. it claims to have a CLI, but I can’t figure out how to make it work, and the “max sentence length” setting doesn’t seem to help (ohhhh, it’s measured in characters, not words—duh 😖). It produces not-quite-correct VTT: no WEBVTT header, and no blank line between entries. VS Code extension issues: * Missing feature: merge subtitles. Merges selected subtitles into one and adjusts timestamps accordingly. * Bug: Sometimes adjusting timing leads to timestamps like `00:40:48.1000`, which should actually be `00:40:49.000`. Clearly there is something slightly wonky with the arithmetic. Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing. Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect. Assuming a good quality recording, the following settings seem to do a good job: * Medium model (specifically `medium.en`). This performs better than the large model when transcribing English because the `large` doesn’t have an English-specific model. * Enable word timestamps. * Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”. * VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.) * Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else. * An initial prompt may improve accuracy, e.g., “This is a postgraduate lecture about ethical issues in big data. The main topics are ethics, law, privacy, data dredging, and statistics.” * Normalising the audio beforehand may or may not help. The `speechnorm` filter in FFmpeg seems quite effective:, e.g.,`ffmpeg -i <input> -filter:a speechnorm=e=12.5 <output>`. For example: ```sh whisper --model medium.en --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 --initial_prompt "<prompt>" <input-file> ``` Models are downloaded to `~/.cache/whisper`. `whisper-cpp` is actually the one we want as it’s written in C++ and supports Core ML. Annoyingly the CLI options are different, but it seems to have more of them. Uses the same models as Vibe below. Input audio must be 16kHz 🙁 (`ffmpeg -i <input> -vn -ar 16000 <output>`). ```sh whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 --prompt "<prompt>" <input file> ``` Vibe seems to be a useful cross-platform GUI implementation. Internally whisper-cpp ported to Rust. Seems to produce very short line lengths. it claims to have a CLI, but I can’t figure out how to make it work, and the “max sentence length” setting doesn’t seem to help (ohhhh, it’s measured in characters, not words—duh 😖). It produces not-quite-correct VTT: no WEBVTT header, and no blank line between entries. VS Code extension issues: * Missing feature: merge subtitles. Merges selected subtitles into one and adjusts timestamps accordingly. * Bug: Sometimes adjusting timing leads to timestamps like `00:40:48.1000`, which should actually be `00:40:49.000`. Clearly there is something slightly wonky with the arithmetic.

Transcribing lectures using Whisper.md

Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing.

Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect.

Assuming a good quality recording, the following settings seem to do a good job:

* Medium model (specifically `medium.en`). This performs better than the large model when transcribing English because the `large` doesn’t have an English-specific model.
* Enable word timestamps.
* Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”.
* VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.)
* Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else.
* An initial prompt may improve accuracy, e.g., “This is a postgraduate lecture about ethical issues in big data. The main topics are ethics, law, privacy, data dredging, and statistics.”
* Normalising the audio beforehand may or may not help. The `speechnorm` filter in FFmpeg seems quite effective:, e.g.,`ffmpeg -i <input> -filter:a speechnorm=e=12.5 <output>`.

For example:

```sh
whisper --model medium.en --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 --initial_prompt "<prompt>" <input-file>
```

Offline transcription is roughly real-time (i.e., 1 hour of audio takes about 1 hour to transcribe).

Models are downloaded to `~/.cache/whisper`.

`whisper-cpp` is actually the one we want as it’s written in C++ and supports Core ML. Annoyingly the CLI options are different, but it seems to have more of them. Uses the same models as Vibe below. Input audio must be 16kHz 🙁 (`ffmpeg -i <input> -vn -ar 16000 <output>`).

```sh
whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 --prompt "<prompt>" <input file>
```

**Much** faster.

**Vibe** seems to be a useful cross-platform GUI implementation. Internally whisper-cpp ported to Rust. Seems to produce very short line lengths. it claims to have a CLI, but I can’t figure out how to make it work, and the “max sentence length” setting doesn’t seem to help (ohhhh, it’s measured in **characters**, not words—duh 😖). It produces not-quite-correct VTT: no WEBVTT header, and no blank line between entries.

VS Code extension issues:

* Missing feature: merge subtitles. Merges selected subtitles into one and adjusts timestamps accordingly.
* Bug: Sometimes adjusting timing leads to timestamps like `00:40:48.1000`, which should actually be `00:40:49.000`. Clearly there is something slightly wonky with the arithmetic.